R statistical software and associated tools have the potential to increase your efficiency, simplify collaboration, and develop reproducible products. Often the biggest limitation to these tools is lack of awareness. This document is intended to provide you with an overview of some of the many R-related tools and is intended for individuals with at least a basic understanding of R. We will learn to create R functions, a robust project structure using R Studio, aesthetically pleasing documentation with R Markdown, and interactive tools with Shiny. In addition, we will delve into the ecosystem of packages developed by R Studio, known as the Tidyverse (e.g., dplyr, tidyr, purr, and ggplot2), which provide useful and intuitive functions for data manipulation, processing, and visualization. Overall, this document will provide you with the working knowledge of some of the most widely used R related tools available for your next project.
R, like any other skill, is going to require you to invest a time to practice. I cannot help you with time management, but I can provide you with an 1) awareness of tools available and 2) the vocabulary and understanding of syntax to make practicing less frustrating.
A good way to invest more time in R is to use R in situations where you would normally use MS Excel.
R is an open source programming language developed for statistical computing and graphic production. “R can be considered as a differenct implementation of S”, a language that was developed at Bell Laboratories (https://www.r-project.org/about.html).
R packages are extensions of base R that provide additional features or provide alternative functionality.
Please install R and RStudio by following the links below. I highly recommend that you also install Git (link below) and create an account on GitHub (link below). The Git related tools are very useful but some workplaces will not allow the use of these tools for security reasons. Therefore, I will briefly cover Git but you are not required to use these tools during the workshop.
| Software | Link |
|---|---|
| R | https://cran.r-project.org/bin/windows/base/ |
| RStudio | https://www.rstudio.com/products/rstudio/download/#download |
| Git | https://git-scm.com/downloads |
| GitHub | https://github.com |
Run the following code in the RGui, NOT in RStudio. The RGui should be installed when you install R. On my Windows machine, I access R by clicking on the R program file, “R x64 3.5.1”.
You should get a window like this if you have opened the correct program.
This code was copied from: https://www.r-statistics.com/2013/03/updating-r-from-r-on-windows-using-the-installr-package/). Make sure R Studio is closed before running this code within the RGui.
# installing/loading the package:
if(!require(installr)) {
install.packages("installr");
require(installr)
} #load / install+load installr
# using the package:
updateR()
Please run the following code within RStudio to make sure you have all of necessary packages for this workshop installed.
package.vec <- c("tidyverse", "lubridate",
"knitr", "rmarkdown", "markdown", "caTools", "bitops",
"DT", "leaflet", "shiny", "jasonlite",
"data.table", "rprojroot", "viridis")
install.packages(package.vec)
Overview
Version control software keeps track of changes made to files. This provides the user with the ability to revert changes back to an earlier version, a backup of the files, and simplifies collaboration efforts.
Git is a free open source version control system (https://git-scm.com/) and GitHub is a platform that allows the user to store their changes made via Git in the form of repositories (https://github.com/). R Studio has integrated Git into their R Studio IDE, making it easy to work with repositories from GitHub (https://support.rstudio.com/hc/en-us/articles/200532077-Version-Control-with-Git-and-SVN). Git will need to be installed locally (https://git-scm.com/downloads) and a GitHub account will need to be to be created (https://github.com/join?source=header-home) before these tools can be accessible from R Studio.
Below are the steps for initializing an R project file connected to a GitHub repository. It is much easier to create the GitHub repository prior to creating project.
Create a new project in R Studio (File -> New Project).
Click “Create Project.” The project will now be linked to the GitHub repository. A “Git” tab will appear within the “Environment” pane in R Studio. If you have an existing R project without a repository on GitHub and you would like to start using version control, I recommend starting from scratch. Create a GitHub repository, connect to a new R Studio project file, and copy the old R project files into the new R project folder.
Whenever updates are made to the files within the R project folder, they will be queued in the “Git” tab that appears in “Environment” pane in R Studio.
Pull every time you open the project to make sure you have the most up-to-date version of the repository. Changes may have been made by you from a different computer or by one of your collaborators.
A commit is an informative message to yourself or collaborators indicating why you made a change. When the “Commit” button is selected, an “RStudio: Review Changes” window will appear. In this window all of the altered files will appear in the upper left pane. By selecting an individual file in the upper left pane, the user can see the changes that were implemented in bottom pane of the “RStudio: Review Changes” window. Deletions will appear in red, while insertions will appear green. One or more files can be staged and then the user has three options: 1) Revert, 2) Ignore, or 3) Commit.
Push commits from R Studio to the GitHub repository.
When a repository is created it consist of a single branch labeled “master.” The master branch will suffice as you first develop the app. However, you may reach a point where the master branch is functioning well (without any known issues) but you want to make some dramatic development changes. Rather than committing the changes to the master branch (potentially breaking your working product), you can create a new isolated branch to work on your development changes. In this case, the branch would clone the current state of the master, and then any edits made to the new branch would not impact the master branch.
When working in R Studio with GitHub, use the drop-down menu (located in the top right corner of the Git tab in the Environment pane) to select the branch you want to work on. In the image below I clicked on “master” and now I can see three branches are available for this project: 1) master, 2) Development, and 3) Zsmith. Simply select a name to change the branch you are working on.
Creating a new branch is relatively simple. There are three ways that I know how to create a repository branch: 1) via R Studio, 2) via GitHub, and 3) via Git Shell.
In the “Git” tab of the Environment pane in R Studio, there is button for creating a new branch. Click on this button:
Log on to your GitHub online GitHub account (https://github.com/) and navigate to the repository you are working on. Under the “Code” tab, click on the “Branch:” button. This will produce a drop down menu (as seen in the image below), where you can select an existing branch or create a new branch by typing the name you want to assign your new branch into the input box labeled “Find or create a branch.” Once you type in the new name, a new box will appear in the dropdown menu that says “Create Branch”. Click on this box to create the new branch.
The Git Shell can be accessed via R Studio in the Git tab of the Environment pane. Click on the “More” dropdown menu and then click “Shell…” (see image below).
A new window will appear. Use this link to understand how to create and work with branches in the Git Shell: https://git-scm.com/book/en/v2/Git-Branching-Basic-Branching-and-Merging.
After you have vetted a new branch, you can merge the new branch with the master branch (or some other branch). The merge will join all the changes made in the new branch and all of the changes made in the master branch. You may run into conflict issues if both branches updated the same section of code (https://help.github.com/articles/resolving-a-merge-conflict-using-the-command-line/).
Merging branches is not as simple as creating branches. As far as I know, branches can only be merged using the Git Shell (see Git Shell to learn how to access the Git Shell). Use the following link to understand how to merge branches: https://help.github.com/articles/merging-an-upstream-repository-into-your-fork/
Free Book: R Markdown: The Definitive Guide RStudio Lessons: https://rmarkdown.rstudio.com/lesson-1.html.
Markdown is a markup language for developing and formating documents. R Markdown is an R-package that allows the user to integrate text, R-code, and R-code output into a well formatted document (e.g., HTML, MS Word, PDF).
My recommendation is to create an R Markdown file for every R-project. The intention is to document as much of the project as possible. R Markdown provides a more readable document, with better descriptions of how and why an activity was performed, than a standard R script with a few commented lines.
Use markdown syntax, some of which is shown in the table below, to format the document.
Once the document is complete (formated with markdown syntax with integrated R code) the document can be knit (rendered) using the package knitr.
Here is a simple example showing the raw R Markdown (Rmd) file before knitting (rendering) and after knitting. The colors on the far left are ther to help identify elements pre- and post-knitting.
R is not the only language supported by R Markdown. The following languages and more can be integrated into an R Markdown file.
---
title: "Untitled"
author: "Zachary M. Smith"
date: "September 23, 2018"
output: html_document
---
Again, your best resource for learning how to use R Markdown will be the R Markdown website (https://rmarkdown.rstudio.com/lesson-1.html), but I will describe some of the general features here.
YAML: YAML Ain’t Markup Language
Heading text follows one or more hash-sign(s) (#). The number of hash-signs determines the hierarchy of headings. For example, “# Heading 1” would represent the primary heading, “## Heading 2” would represent the secondary heading, “###Heading 3” would represent the tertiary heading, and so forth.
Simply add text below the YAML header.
To insert a code chunk, press Ctrl + Alt + i in the source pane (top left pane in the default settings of Studio). A code chunk will appear:
Inside the code chunk you can write and run R-code. If you print the output of your R-code it will appear below the code chunk in the source pane and the printed output will appear in the final compiled document. This is useful for producing figures and tables.
To view the html document, you must compile the document using Knit. Follow these steps to Knit the document:
Find and click the Knit button (it looks like a ball of yarn) in the toolbar above the editor window.
The compiled file will be saved in the same directory as your Rmd file (your R Markdown file).
I store the R Markdown file(s) in a sub-directory labeled “markdown” within the R-project folder (rproject/markdown).
In general, I find that a single R Markdown file quickly becomes unwieldy. I recommend breaking the document up into multiple “child” documents and sourcing these child documents in a parent document. My child documents generally represent major subsections of the document.
I store the parent R Markdown file in the “markdown” folder (rproject/markdown) and the child R Markdown files in a sub-directory of my “markdown” folder called “sections” (rproject/markdown/sections). In the parent file, the child files are sourced within the code chunk header using “child = ‘sections/example.Rmd’. After sourcing all the child chunks, the parent file can be knit (compiled) like a normal R markdown document. The child documents cannot be run in the parent file.
The parent file is great for organizing sections of your document, but the child documents cannot be executed within R Studio like a normal code chunk. Without the ability to easily execute the R code within the child documents it can become very difficult to develop new child documents because new child documents often depend on upstream code execution.
Imagine you have a parent document that sources child sections which import your data and clean your data. You now want to visualize your data; accordingly, you begin to develop a visualization child document, which depends on information from the upstream child sections. It would be inefficient and inappropriate to perform all the steps in the upstream child sections within the visualization section. Therefore, you need an effective way to execute the upstream child sections while you continue to develop the visualization section. The inefficient way of doing this is to open each child Rmd file in R Studio and execute them manually in the correct sequence. This becomes tedious after you have three or more documents (imagine doing this for 10+ child sections). The most efficient way that I have found to run upstream child sections is to extract the R-code chunks from each Rmd file, save them in a “raw_scripts” folder, and then source/execute the scripts within a regular R script file (.R).
In this section we establish the file path to the folder that contains all the child documents. The names of the child documents are extracted and stored as a vector. The grepl() function is used to retain only the Rmd files stored in the vector.
sections.path <- c("notebook/sections")
r.files.vec <- list.files(sections.path)
r.file.vec <- r.files.vec[grepl(".Rmd", r.files.vec)]
Next, a file path is specified for the R-scripts that will be extracted from the R Markdown documents; I place these files within a “raw_script/extracted” folder. The map() function from the purrr package is used to loop through each file in the vector (r.files.vec). Within the map() loop, the purl() function from knitr is used to extract the R-code from the R Markdown documents and save the code to the specified folder.
extracted.path <- c("notebook/raw_script/extracted")
purrr::map(r.files.vec, function(file.i) {
file.name <- gsub(".Rmd", "", ".R")
extracted.file <- paste0(file.name, ".R")
knitr::purl(
file.path(sections.path, file.i),
file.path(extracted.path, extracted.file)
)
})
Finally, create a vector of file names (source.vec) stored in the “raw_script/extracted” folder. You will want to type these out manually (do not use list.files() functions) because in this format you can easily comment out certain scripts and only run the scripts of interest. The map() is then used to loop through each specified file in source.vec. Keep in mind that the order of the file names specified in source.vec will determine the order that these files are executed in the map() function; therefore, order the files in source.vec from furthest upstream to furthest downstream. Each iteration of the loop, executes (sources) the specified R-script.
source.vec <- c(
"import_data.R",
"clean_data.R",
"visualize_data.R"
)
purrr::map(source.vec, function(source.i) {
source(file.path(extracted.path, source.i))
})
Once all the R-scripts extracted from the upstream child R Markdown files have been executed, you can begin or continue work on a new child R Markdown document. I keep all the above code in a single R-script and execute the entire script each time I use this file to make sure all of the files are up-to-date.
is.numeric(c(1, 2, 3.5, 6.7))
## [1] TRUE
is.numeric(c(1, 2, 3, 6))
## [1] TRUE
is.numeric(1:5)
## [1] TRUE
is.numeric(c(TRUE, FALSE, TRUE))
## [1] FALSE
is.numeric(c("a", 2, 3, 6))
## [1] FALSE
is.numeric(c("a", "b", "c", "d"))
## [1] FALSE
is.integer()help file: “is.integer(x) does not test if x contains integer numbers!” *EXAMPLE:is.integer(c(1, 2, 3.5, 6.7))
## [1] FALSE
is.integer(c(1, 2, 3, 6))
## [1] FALSE
is.integer(1:5)
## [1] TRUE
is.integer(c(TRUE, FALSE, TRUE))
## [1] FALSE
is.integer(c("a", 2, 3, 6))
## [1] FALSE
is.integer(c("a", "b", "c", "d"))
## [1] FALSE
is.logical(c(1, 2, 3.5, 6.7))
## [1] FALSE
is.logical(c(1, 2, 3, 6))
## [1] FALSE
is.logical(1:5)
## [1] FALSE
is.logical(c(TRUE, FALSE, TRUE))
## [1] TRUE
is.logical(c("a", 2, 3, 6))
## [1] FALSE
is.logical(c("a", "b", "c", "d"))
## [1] FALSE
is.character(c(1, 2, 3.5, 6.7))
## [1] FALSE
is.character(c(1, 2, 3, 6))
## [1] FALSE
is.character(1:5)
## [1] FALSE
is.character(c(TRUE, FALSE, TRUE))
## [1] FALSE
is.character(c("a", 2, 3, 6))
## [1] TRUE
is.character(c("a", "b", "c", "d"))
## [1] TRUE
is.factor(c(1, 2, 3.5, 6.7))
## [1] FALSE
is.factor(c(1, 2, 3, 6))
## [1] FALSE
is.factor(1:5)
## [1] FALSE
is.factor(c(TRUE, FALSE, TRUE))
## [1] FALSE
is.factor(c("a", 2, 3, 6))
## [1] FALSE
is.factor(c("a", "b", "c", "d"))
## [1] FALSE
Example of sorting a character vector.
sort(c("a", "b", "c", "d"))
## [1] "a" "b" "c" "d"
Example of sorting a factor vector.
sort(factor(c("a", "b", "c", "d"),
levels = c("d", "a", "c", "b")))
## [1] d a c b
## Levels: d a c b
as.numeric(sort(factor(
c("a", "b", "c", "d"),
levels = c("d", "a", "c", "b")
)))
## [1] 1 2 3 4
In this example there are three stations on the Hudson River (i.e., Port of Albany, West Point, and Pier 84).
Although this is a tidal river, we would generally want to sort and plot this data from upstream to downstream (i.e., Port of Albany, West Point, and Pier 84). The table below shows how sort() would arrange the data if it is stored as a character vs. factor type.
Common Data Structures:
Numeric vector
c(1, 2, 3, 4, 5)
## [1] 1 2 3 4 5
Character Vector
c("A", "B", "C", "D", "E")
## [1] "A" "B" "C" "D" "E"
Logical vector
c(TRUE, FALSE, TRUE, FALSE, TRUE)
## [1] TRUE FALSE TRUE FALSE TRUE
matrix(c(1:5, 2, 7, 4, 9, 20, 40, 23, 64, 67, 80, 3, 76, 29, 59, 91),
ncol = 4)
## [,1] [,2] [,3] [,4]
## [1,] 1 2 40 3
## [2,] 2 7 23 76
## [3,] 3 4 64 29
## [4,] 4 9 67 59
## [5,] 5 20 80 91
array(c(c(1:5, 2, 7, 4, 9, 20, 40, 23, 64, 67, 80, 3, 76, 29, 59, 91, 1, 3, 5, 7, 9),
c(2, 3, 34, 45, 57, 26, 74, 42, 91, 20, 1, 0, 82, 31, 45, 1, 0, 1, 0, 1, 9:5),
c(1, 0, rep(1, 3), rep(0, 5), 1, 0, 1, 1, 0, rep(1, 5), 0, 0, 1, 1, 0)),
c(5, 5, 3))
## , , 1
##
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 40 3 1
## [2,] 2 7 23 76 3
## [3,] 3 4 64 29 5
## [4,] 4 9 67 59 7
## [5,] 5 20 80 91 9
##
## , , 2
##
## [,1] [,2] [,3] [,4] [,5]
## [1,] 2 26 1 1 9
## [2,] 3 74 0 0 8
## [3,] 34 42 82 1 7
## [4,] 45 91 31 0 6
## [5,] 57 20 45 1 5
##
## , , 3
##
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 0 1 1 0
## [2,] 0 0 0 1 0
## [3,] 1 0 1 1 1
## [4,] 1 0 1 1 1
## [5,] 1 0 0 1 0
data.frame(
Numeric = c(1, 2, 3, 4, 5),
Charcter = c("A", "B", "C", "D", "E"),
Logical = c(TRUE, FALSE, TRUE, FALSE, TRUE),
stringsAsFactors = FALSE
)
## Numeric Charcter Logical
## 1 1 A TRUE
## 2 2 B FALSE
## 3 3 C TRUE
## 4 4 D FALSE
## 5 5 E TRUE
list(
c(TRUE, FALSE, TRUE, FALSE, TRUE),
matrix(c("A", "B", "C", "D", "E",
"AB", "BC", "CD", "DE", "EF",
rep("BLAH", 5),
"TEXT", "text", "TEXT", "text", "TEXT"),
ncol = 4),
array(c(c(1:5, 2, 7, 4, 9, 20, 40, 23, 64, 67, 80, 3, 76, 29, 59, 91, 1, 3, 5, 7, 9),
c(2, 3, 34, 45, 57, 26, 74, 42, 91, 20, 1, 0, 82, 31, 45, 1, 0, 1, 0, 1, 9:5),
c(1, 0, rep(1, 3), rep(0, 5), 1, 0, 1, 1, 0, rep(1, 5), 0, 0, 1, 1, 0)),
c(5, 5, 3)),
data.frame(
Numeric = c(1, 2, 3, 4, 5),
Charcter = c("A", "B", "C", "D", "E"),
Logical = c(TRUE, FALSE, TRUE, FALSE, TRUE),
stringsAsFactors = FALSE
)
)
## [[1]]
## [1] TRUE FALSE TRUE FALSE TRUE
##
## [[2]]
## [,1] [,2] [,3] [,4]
## [1,] "A" "AB" "BLAH" "TEXT"
## [2,] "B" "BC" "BLAH" "text"
## [3,] "C" "CD" "BLAH" "TEXT"
## [4,] "D" "DE" "BLAH" "text"
## [5,] "E" "EF" "BLAH" "TEXT"
##
## [[3]]
## , , 1
##
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 40 3 1
## [2,] 2 7 23 76 3
## [3,] 3 4 64 29 5
## [4,] 4 9 67 59 7
## [5,] 5 20 80 91 9
##
## , , 2
##
## [,1] [,2] [,3] [,4] [,5]
## [1,] 2 26 1 1 9
## [2,] 3 74 0 0 8
## [3,] 34 42 82 1 7
## [4,] 45 91 31 0 6
## [5,] 57 20 45 1 5
##
## , , 3
##
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 0 1 1 0
## [2,] 0 0 0 1 0
## [3,] 1 0 1 1 1
## [4,] 1 0 1 1 1
## [5,] 1 0 0 1 0
##
##
## [[4]]
## Numeric Charcter Logical
## 1 1 A TRUE
## 2 2 B FALSE
## 3 3 C TRUE
## 4 4 D FALSE
## 5 5 E TRUE
<-)The assignment opperator (<-) is used to assign data structures to a named object that will be stored in the Environment. The object on the left side of the assignment opperator is going to refer to a new or existing data object. The object to the right of the assignment opperator is going to represent the data that we want to store in the object to the left of the opperator. When the code is executed, the object will be stored in the Environment, which is visible within the RStudio Environment window.
For example, the integer vector from 1-5, created using c(1:5), is stored with the object name, int.vec. The image below this code chunk, shows the Environment before and after the code has been executed. We can see that int.vec has been added to the Environment.
int.vec <- c(1:5)
Here is an example of a data frame being stored as an object, example.df. See [Environment Tab] section to review the helpful tools available in RStudio for viewing elements of a data frame.
example.df <- data.frame(
Numeric = c(1, 2, 3, 4, 5),
Character = c("A", "B", "C", "D", "E"),
Logical = c(TRUE, FALSE, TRUE, FALSE, TRUE),
stringsAsFactors = FALSE
)
char.vec <- c("a", "b", "c", "d", "e")
char.vec
## [1] "a" "b" "c" "d" "e"
char.vec[3]
## [1] "c"
char.vec[c(1, 3)]
## [1] "a" "c"
The colon opperator can be helpful for grabbing a sequential range. Here I am specifiying elements 1, 2, and 3 of char.vec.
char.vec[1:3]
## [1] "a" "b" "c"
This data frame will be used as an example in this section.
example.df <- data.frame(
Numeric = c(1, 2, 3, 4, 5),
Character = c("A", "B", "C", "D", "E"),
Logical = c(TRUE, FALSE, TRUE, FALSE, TRUE),
stringsAsFactors = FALSE
)
knitr::kable(example.df)
| Numeric | Character | Logical |
|---|---|---|
| 1 | A | TRUE |
| 2 | B | FALSE |
| 3 | C | TRUE |
| 4 | D | FALSE |
| 5 | E | TRUE |
When dealing with data frames think of the square brackets ([, ]) as a coordinate system. Values specified to the left of the comma in [, ] represents rows, while everything to the right of the comma represents columns.
example.df[, 1]
## [1] 1 2 3 4 5
example.df[1, ]
## Numeric Character Logical
## 1 1 A TRUE
example.df[1, 1]
## [1] 1
example.df[1:2, 1:2]
## Numeric Character
## 1 1 A
## 2 2 B
example.df$Numeric
## [1] 1 2 3 4 5
example.df[, "Numeric"]
## [1] 1 2 3 4 5
example.df[example.df$Numeric == 1, ]
## Numeric Character Logical
## 1 1 A TRUE
example.df[example.df$Character %in% c("A", "C"), ]
## Numeric Character Logical
## 1 1 A TRUE
## 3 3 C TRUE
example.df[example.df$Character %in% c("A", "C"), "Numeric"]
## [1] 1 3
Code style is more important than you may first imagine. Adopting a consistent style will make it easier for you and your collaborators to read and comprehend your code. Please review and in future R-code use Hadley Wickham’s (http://style.tidyverse.org/index.html) style guide. As mentioned in Hadely Wickham’s guide, his guide is adapted from Google’s style guide (https://google.github.io/styleguide/Rguide.xml); therefore, there are many similarities. I do not want to recreate these style guides here, instead I want to highlight what I believe are some of the more important features.
File names, function names, and column names should not contain spaces. It is very easy to create a name with two subsequent spaces by mistake and very frustrating to later trouble shoot why your call to this name in your R code is returning an error.
I use snake case, “snake_case”, instead of “snake case” or “snakeCase;” the later, “snakeCase”, is known as camel case. Many programmers use camel case but I find I make more typos when I use this naming scheme and in find snake case easier to read. Please use snake case.
For object names, I prefer a style similar to the one found in Google’s style guide but with a descriptive suffix. I cannot provide a reference to this style but at one point I adopted a descriptive suffix that describes the objects class (e.g., data frame = “.df”, vector = “.vec”, scalar = “.scal”, list = “.lst”, matrix = “.mat”, and time series = “.ts”). I have found this simple naming scheme to be very helpful because I immediately know what the intended class of the object at any point that it is referenced in the script. This makes it easier to identify a bug if an object named “object.df” is represented in my RStudio Environment pane as a vector or list.
Examples:
# Data Frame-------------------------------------------------------------------
my.df <- data.frame()
# Vector-----------------------------------------------------------------------
my.vec <- c("a", "b", "c")
my.scal <- "a"
my.vec <- c(1:3)
my.scal <- 1
# Matrix-----------------------------------------------------------------------
my.mat <- matrix()
# List-------------------------------------------------------------------------
my.lst <- list()
# Time Series------------------------------------------------------------------
my.ts <- ts()
Please follow the spacing (http://style.tidyverse.org/syntax.html#spacing) and indenting (http://style.tidyverse.org/syntax.html#indenting) guide lines provided in Hadely Wickham’s style guide. I find it very difficult to follow code that does not adhere to these guidelines.
The following “good” and “bad” examples will create the exact same data frame. However, the “good” example is much easier to read and interpret.
Good Example:
good.df <- data.frame(
alphabet = letters,
square_root = sqrt(81),
add = 1 + 1,
subtract = 1 - 1,
multiply = 2 * 2,
divide = 2 / 2,
power = 2^2
)
Bad Example:
bad.df<-data.frame(alphabet=letters,square_root=sqrt(81),add=1+1,subtract=1-1,multiply=2*2,divide=2/2,power=2^2)
It is highly recommended that you learn to write your own functions. In general, anytime you have the urge to copy and paste code to perform a similar process on a different aspect of your data you should instead write a function. A function is a single standardized process that you apply to similar data sets. Therefore, it is easier to update the function rather than multiple chunks of code that have been copied and pasted throughout your script. For example, imagine you have copied and pasted code ten times. Later you want to update the code. Now you have to update all ten instances of the code. This is time consuming and it would be easy to miss or add a typo to one of the ten instances. If the process is stored as a function, you make one update to the function and that update is applied in a standardized manner to all ten instances of your data.
The image below provides an overview of the pieces necessary to create a function. Use function() to initialize the creation of a new function. Within function you will, generally, add arguements (in example below: x, y) that you will want to be able to modify. Curly brackets ({}) following the function() call define the extent of the body of the function being created. Within the curly brackets is wher most of the work is performed; this is the area where we insert our code that we want to become a standarized process. This example is very simple, just x -y, but you could insert 100+ lines of code into a function. However, really lenghty functions are poor practice. Finally, the function needs to be stored as an object. Use the assignment opperator (<-) to assign this function an object name (e.g., subtract <-).
Let us actually create this function.
subtract <- function(x, y) {
x - y
}
Once this function has been executed, it is stored as a object named “subtract”. This object can be seen in the Environment window.
Now we can test our function.
subtract(x = 10, y = 5)
## [1] 5
We can get the same results without specifically calling out x and y. R will assume that you understand the order of the arguements is x, y, and therefore 10 will represent x and 5 will represent y.
subtract(10, 5)
## [1] 5
If we reverse the order, then 5 represents x and 10 represents y.
subtract(5, 10)
## [1] -5
Our intention with the created subtract function is to subtract one value from another. However, the code in the chunk below will run. In this code x represents the numeric vector c(1, 2, 3) and y represents the numeric vector c(3, 2, 1).
subtract(c(1, 2, 3), c(3, 2, 1))
## [1] -2 0 2
We only want this function to execute if x and y are each of length one. A conditional if() statement combined with a stop() function can be used to prevent the function from working on data that does not meet our requirements (x of length 1 and y of length 1). Note that this update would normarlly be made to the orginal function call up above. subtract() is only updated here to show the progression of our development of this function.
subtract <- function(x, y) {
if (length(x) != 1 | length(y) != 1) stop("x and y must length 1")
x - y
}
Now if a vector of length 3 is provided for the x or y arguments, the function will return an error with the message “x and y must length 1.” This is the message that was defined in the stop() call within the creation of the subtract() function above.
subtract(c(1, 2, 3), c(3, 2, 1))
## Error in subtract(c(1, 2, 3), c(3, 2, 1)): x and y must length 1
Double check that the original values still work.
subtract(10, 5)
## [1] 5
mult() function returns 6 if x = 1, y = 2, z = 3x = c(1, 2), y = 2, z = 3x = 1, y = c(1, 2), z = 3x = 1, y = 2, z = c(1, 2)Which message will you recieve if x = 1, y = c(1, 2), z = c(1, 2)? Why is this the only message that is returned?
paste() (?paste)substitute() within the paste() call to return the name of the arguement supplied (i.e., “x”, “y”, or “z”)check_length() to your mult() function.mult() function on:
x = c(1, 2), y = 2, z = 3x = 1, y = c(1, 2), z = 3x = 1, y = 2, z = c(1, 2)mult(x = 1, y = 2, z = 3)
## Error in mult(x = 1, y = 2, z = 3): could not find function "mult"
mult(x = c(1, 2), y = 2, z = 3)
## Error in mult(x = c(1, 2), y = 2, z = 3): could not find function "mult"
mult(x = 1, y = c(1, 2), z = 3)
## Error in mult(x = 1, y = c(1, 2), z = 3): could not find function "mult"
mult(x = 1, y = 2, z = c(1, 2))
## Error in mult(x = 1, y = 2, z = c(1, 2)): could not find function "mult"
*Link: https://www.tidyverse.org/
*Great Place to Start: https://r4ds.had.co.nz/
The tidyverse is an ecosystem of packages that work well together and often make it easier to deal with data in R. These packages are mostly developed and maintained by RStudio staff but contributions are frequently made by members of the R community.
WARNING: These packages are very helpful but many are not yet stable. There is potential that a function could undergo a large change or be depreciated in the future. This can have a significant negative impact on the reproducibility of your work. The functions in tidyverse packages will generally provide helpful messages indicating if a function has been depreciated and indicate the function that has taken its place.
Instead of loading each tidyverse package individually, library(tidyverse) will load the most frequently used packages: ggplot2, purrr, tibble, dplyr, tidyr, stringr, readr, and forcats.
library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0 v purrr 0.3.0
## v tibble 2.0.1 v dplyr 0.8.0.1
## v tidyr 0.8.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked _by_ '.GlobalEnv':
##
## subtract
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
This vector will be used as an example for this section. Notice that all of the strings are slightly different. I frequently find issues like this in data sets I have been given. We will assume that all of the strings are intended to represent the same information, and therefore require some processing to standardize the strings.
x <- c("example 1",
" example 1",
"example 1 ",
"example_1")
I often find that the pipe operator, %>%, from the magrittr package is confusing to those unfamiliar with the tidyverse. It takes a little while to wrap your head around the pipe operator but once you do I think you will find its use makes R code more legible. In essence, base R works from the inside-out, while the pipe operator presents the code in a linear fashion.
For example, imagine you have a character vector x and you want to trim leading/trailing white space, then keep only unique strings, and then convert all characters to lowercase. In base R, the code would look like the code in the chunk below. Again, base R works from the inside-out, so first trimws() is used to remove leading/trailing white space, then tolower() is used to convert all characters to lowercase, and finally str_replace() is used to replace any spaces with an underscore.
str_replace(tolower(trimws(x)), " ", "_")
## [1] "example_1" "example_1" "example_1" "example_1"
Using the pipe operator, the code functions just the same but it is formatted in a more legible manner:
x is piped to trims()trims() is piped to tolower()tolower() is piped to str_replace()str_replace() is returnedThe pipe operator presents the functions in the order you intend them to be performed; therefore, the code should be easier to read and more intuitive, which should in turn reduce errors.
x %>%
trimws() %>%
tolower() %>%
str_replace(" ", "_")
## [1] "example_1" "example_1" "example_1" "example_1"
Most tidyverse packages (e.g., NOT ggplot2) are designed to work well with the pipe operator. The pipe operator pushes the call on the left side of the operator to the first position of the function on the right of the operator. Therefore, functions developed outside of the tidyverse may not automatically work will with the pipe operator. A . can be used as an a placeholder.
In the examples above, str_replace(), from the stringr package, was used. This function is designed to work well with the pipe operator but what if we wanted to use gsub(), the base R equivalent to str_replace()?
str_replace() structure:
str_replace(string, pattern, replacement)gsub() structure:
gsub(pattern, replacement, x)In gsub(), the x arguement is equivalent to the string arguement in str_replace(). Notice the difference in the order of the arguements. In str_replace(), string is the first arguement, but in gsub(), x is the last arguement. Therefore, when using the pipe operator with gsub(), the placeholder (.) must be used becuase the output from x %>% trimws() %>% tolower() by default will be piped into the pattern arguement, the first arguement.
Here is the error recieved if the placeholder (.) is NOT used:
x %>%
trimws() %>%
tolower() %>%
gsub()
## Error in gsub(.): argument "x" is missing, with no default
Here is how the placeholder(.) would be used in a gsub() call. pattern and replacement are speciefied as normal, while x = . indicates where the output from x %>% trimws() %>% tolower() should be used.
x %>%
trimws() %>%
tolower() %>%
gsub(pattern = " ",
replacement = "_",
x = .)
## [1] "example_1" "example_1" "example_1" "example_1"
library(magrittr)
toupper()unique()suppressPackageStartupMessages(
library(dplyr)
)
Import the example data. This data represents benthic macroinvertebrate data collected in the littoral zone of Onondaga, Otisco, and Cazenovia lakes
taxa.df <- file.path("data",
"zms_thesis-macro_2017-06-18.csv") %>%
read.csv(stringsAsFactors = FALSE)
DT::datatable(taxa.df, options = list(scrollX = TRUE))
env.df <- file.path("data",
"zms_thesis-env_2017-06-18.csv") %>%
read.csv(stringsAsFactors = FALSE)
DT::datatable(env.df, options = list(scrollX = TRUE))
In the example below, the columns lat and long are renamed to latitude and longitude, respectively. If names(taxa.df) is called, elements 5 and 6 are now represented by “latitude” and “longitude”, respectively.
taxa.df <- taxa.df %>%
rename(latitude = lat,
longitude = long)
names(taxa.df)
## [1] "unique_id" "lake" "station_id" "sample_number"
## [5] "latitude" "longitude" "agency_code" "method"
## [9] "date" "count" "life_stage" "final_id"
## [13] "taxon_level" "phylum" "subphylum" "class"
## [17] "subclass" "order" "suborder" "family"
## [21] "subfamily" "tribe" "genus" "species"
## [25] "picked" "squares"
In the example below, taxa.df is subset to only include rows where the unique_id column matches the string “caz_1_1”.
filter.df <- taxa.df %>%
filter(unique_id == "caz_1_1")
DT::datatable(filter.df, options = list(scrollX = TRUE))
You can apply multiple filters separate by commas. The filters are applied from the top down. In the example below, two filters are applied within the filter() call:
unique_id == "caz_1_1"order %in% c("ephemeroptera", "trichoptera")The first logical statement is the same as the filter from the code chunk above, which keeps only rows associated with the sample “caz_1_1”. Then, the order column is subset the data to only include rows represented by ephemeroptera (mayfly) or trichoptera (caddisfly) taxa.
filter.df <- taxa.df %>%
filter(unique_id == "caz_1_1",
order %in% c("ephemeroptera", "trichoptera"))
DT::datatable(filter.df, options = list(scrollX = TRUE))
Often we are not interested in all columns in a data frame. select() can be used to subset the columns to only include columns of interest. In the example below, the filter.df data frame, created in the Filter section, is subset to only include three columns: 1) final_id, count, and unique_id.
select.df <- filter.df %>%
select(unique_id, final_id, count)
knitr::kable(select.df)
| unique_id | final_id | count |
|---|---|---|
| caz_1_1 | baetidae | 1 |
| caz_1_1 | caenis | 14 |
| caz_1_1 | agraylea | 1 |
| caz_1_1 | oxyethira | 2 |
| caz_1_1 | hydroptilidae | 1 |
The same operation can be performed by chaining the functions together with the pipe operator. taxa.df is first filtered to only include rows that meet our specified logical statements and then only the columns unique_id, final_id, and count are retained.
select.df <- taxa.df %>%
filter(unique_id == "caz_1_1",
order %in% c("ephemeroptera", "trichoptera")) %>%
select(unique_id, final_id, count)
knitr::kable(select.df)
| unique_id | final_id | count |
|---|---|---|
| caz_1_1 | baetidae | 1 |
| caz_1_1 | caenis | 14 |
| caz_1_1 | agraylea | 1 |
| caz_1_1 | oxyethira | 2 |
| caz_1_1 | hydroptilidae | 1 |
Reorder columns. Maybe you would prefer to see the columns in the following order:
final_idunique_idcountreorder.df <- taxa.df %>%
filter(unique_id == "caz_1_1",
order %in% c("ephemeroptera", "trichoptera")) %>%
select(final_id, unique_id, count)
knitr::kable(reorder.df)
| final_id | unique_id | count |
|---|---|---|
| baetidae | caz_1_1 | 1 |
| caenis | caz_1_1 | 14 |
| agraylea | caz_1_1 | 1 |
| oxyethira | caz_1_1 | 2 |
| hydroptilidae | caz_1_1 | 1 |
select. Specifies all columns not already specified in the select() call.For example, maybe we want to see the first three columns reordered but the remaining columns to remain in the same order. Without everything(), we would need to specify each column in the select() call. With everything(), we can specify the first three columns to be lake, station_id, and sample_number followed by everything(). The columns will then be reordered accordingly (lake, station_id, sample_number, unique_id, etc.).
reorder.df <- taxa.df %>%
filter(unique_id == "caz_1_1",
order %in% c("ephemeroptera", "trichoptera")) %>%
select(lake, station_id, sample_number, everything())
DT::datatable(reorder.df, options = list(scrollX = TRUE))
After performing a subsetting columns with select(), it may be beneficial to subsequently run distinct() to remove any duplicate rows.
Maybe we are just interested in viewing basic sample information such as lake, station_id, and sample_number. If select() is performed to subset the data frame to only represent these columns, then there will be many duplicate rows. The reason for this is in the original taxa.df data frame there are multiple taxa observed per sample. lake, station_id, and sample_number are associated accordingly with the taxa, and therefore lake, station_id, and sample_number are represented many times.
nondistinct.df <- taxa.df %>%
select(lake, station_id, sample_number)
DT::datatable(nondistinct.df,
options = list(columnDefs = list(list(className = 'dt-center', targets = 0:3))))
distinct() will remove the duplicate rows present in the output above. This is a very simple call, which requires no input.
distinct.df <- taxa.df %>%
select(lake, station_id, sample_number) %>%
distinct()
DT::datatable(distinct.df,
options = list(columnDefs = list(list(className = 'dt-center', targets = 0:3))))
In the example below, I use the select() and distinct() combination from the distinct section and then filter the data frame to only include rows where lake equals “caz”. The distinct() call has insured that each row is unique but there is not a single column that could be used to uniquely identify a given sample. Using mutate() we can create a new column, called new_id, that concatenates the station_id and sample_number columns into a single value separated by an underscore (paste(station_id, sample_number, sep = "_")).
mutate.df <- taxa.df %>%
select(lake, station_id, sample_number) %>%
distinct() %>%
filter(lake == "caz") %>%
mutate(new_id = paste(station_id, sample_number,
sep = "_"))
DT::datatable(mutate.df,
options = list(columnDefs = list(list(className = 'dt-center', targets = 0:4))))
Maybe a second type of unique identifier is wanted for certain subsequent processes. mutate() allows for the creation of one or more columns within a single call. In the example below, new_id is created as it was in the example above but is followed by the creation of date_id. date_id is created by concatenating new_id and date columns. Therefore, a single mutate() call has created two new columns. Additionally, we can see that columns created downstream (date_id) can reference columns created upstream (new_id) within the mutate() call.
mutate.df <- taxa.df %>%
select(lake, station_id, sample_number, date) %>%
distinct() %>%
filter(lake == "caz") %>%
mutate(new_id = paste(station_id, sample_number,
sep = "_"),
date_id = paste(new_id, date,
sep = "_"))
DT::datatable(mutate.df,
options = list(columnDefs = list(list(className = 'dt-center', targets = 0:6))))
Calling group_by() will aggregate the data by the column(s) specified but alone will not alter the content of the data frame.
group.df <- taxa.df %>%
filter(unique_id %in% c("caz_1_1", "caz_1_2")) %>%
select(unique_id, final_id, count) %>%
group_by(unique_id)
knitr::kable(group.df)
| unique_id | final_id | count |
|---|---|---|
| caz_1_1 | physa | 1 |
| caz_1_1 | gyraulus | 1 |
| caz_1_1 | bezzia | 5 |
| caz_1_1 | mallochohelea | 7 |
| caz_1_1 | chironomus | 1 |
| caz_1_1 | dicrotendipes | 3 |
| caz_1_1 | paratanytarsus | 34 |
| caz_1_1 | cricotopus | 1 |
| caz_1_1 | psectrocladius | 5 |
| caz_1_1 | thienemanniella | 4 |
| caz_1_1 | krenopelopia | 9 |
| caz_1_1 | procladius | 1 |
| caz_1_1 | hemerodromia | 1 |
| caz_1_1 | baetidae | 1 |
| caz_1_1 | caenis | 14 |
| caz_1_1 | acentria | 1 |
| caz_1_1 | enallagma | 1 |
| caz_1_1 | agraylea | 1 |
| caz_1_1 | oxyethira | 2 |
| caz_1_1 | hydroptilidae | 1 |
| caz_1_1 | gammarus | 2 |
| caz_1_1 | hyalella | 14 |
| caz_1_1 | caecidotea | 2 |
| caz_1_2 | amnicola_grana | 2 |
| caz_1_2 | physa | 27 |
| caz_1_2 | gyraulus_parvus | 1 |
| caz_1_2 | helisoma_anceps | 2 |
| caz_1_2 | promenetus_exacuous | 5 |
| caz_1_2 | valvata | 1 |
| caz_1_2 | bezzia | 9 |
| caz_1_2 | mallochohelea | 2 |
| caz_1_2 | chironomus | 5 |
| caz_1_2 | dicrotendipes | 4 |
| caz_1_2 | microtendipes | 1 |
| caz_1_2 | polypedilum | 2 |
| caz_1_2 | pseudochironomus | 1 |
| caz_1_2 | rheotanytarsus | 2 |
| caz_1_2 | thienemanniella | 1 |
| caz_1_2 | ablabesmyia | 5 |
| caz_1_2 | callibaetis | 1 |
| caz_1_2 | caenis | 1 |
| caz_1_2 | caenidae | 2 |
| caz_1_2 | acentria | 1 |
| caz_1_2 | mystacides | 2 |
| caz_1_2 | gammarus | 1 |
| caz_1_2 | hyalella | 13 |
| caz_1_2 | caecidotea | 9 |
The image below shows “under the hood”" information about group.df from the [Environment Tab] before and after the group_by() call is made. We do not need to focus on the details associated with this “under the hood” information but this way we can view the change made by the group_by() call; unlike the table shown above, which does not provided any indication that the data has been aggregated.
We can follow group_by() by a mutate() call to calculate a single value per group. Each group variable is replicated for each row within a group. In the example below, we are aggregating by two unique_ids (group_by(unique_id); “caz_1_1” and “caz_1_2”) and finding the total number of organisms identified within each group (mutate(total = sum(count))). The sum() function returns one number per group. mutate() then replicates this one number for all rows within a group.
group.df <- taxa.df %>%
filter(unique_id %in% c("caz_1_1", "caz_1_2")) %>%
select(unique_id, final_id, count) %>%
group_by(unique_id) %>%
mutate(total = sum(count))
knitr::kable(group.df)
| unique_id | final_id | count | total |
|---|---|---|---|
| caz_1_1 | physa | 1 | 112 |
| caz_1_1 | gyraulus | 1 | 112 |
| caz_1_1 | bezzia | 5 | 112 |
| caz_1_1 | mallochohelea | 7 | 112 |
| caz_1_1 | chironomus | 1 | 112 |
| caz_1_1 | dicrotendipes | 3 | 112 |
| caz_1_1 | paratanytarsus | 34 | 112 |
| caz_1_1 | cricotopus | 1 | 112 |
| caz_1_1 | psectrocladius | 5 | 112 |
| caz_1_1 | thienemanniella | 4 | 112 |
| caz_1_1 | krenopelopia | 9 | 112 |
| caz_1_1 | procladius | 1 | 112 |
| caz_1_1 | hemerodromia | 1 | 112 |
| caz_1_1 | baetidae | 1 | 112 |
| caz_1_1 | caenis | 14 | 112 |
| caz_1_1 | acentria | 1 | 112 |
| caz_1_1 | enallagma | 1 | 112 |
| caz_1_1 | agraylea | 1 | 112 |
| caz_1_1 | oxyethira | 2 | 112 |
| caz_1_1 | hydroptilidae | 1 | 112 |
| caz_1_1 | gammarus | 2 | 112 |
| caz_1_1 | hyalella | 14 | 112 |
| caz_1_1 | caecidotea | 2 | 112 |
| caz_1_2 | amnicola_grana | 2 | 100 |
| caz_1_2 | physa | 27 | 100 |
| caz_1_2 | gyraulus_parvus | 1 | 100 |
| caz_1_2 | helisoma_anceps | 2 | 100 |
| caz_1_2 | promenetus_exacuous | 5 | 100 |
| caz_1_2 | valvata | 1 | 100 |
| caz_1_2 | bezzia | 9 | 100 |
| caz_1_2 | mallochohelea | 2 | 100 |
| caz_1_2 | chironomus | 5 | 100 |
| caz_1_2 | dicrotendipes | 4 | 100 |
| caz_1_2 | microtendipes | 1 | 100 |
| caz_1_2 | polypedilum | 2 | 100 |
| caz_1_2 | pseudochironomus | 1 | 100 |
| caz_1_2 | rheotanytarsus | 2 | 100 |
| caz_1_2 | thienemanniella | 1 | 100 |
| caz_1_2 | ablabesmyia | 5 | 100 |
| caz_1_2 | callibaetis | 1 | 100 |
| caz_1_2 | caenis | 1 | 100 |
| caz_1_2 | caenidae | 2 | 100 |
| caz_1_2 | acentria | 1 | 100 |
| caz_1_2 | mystacides | 2 | 100 |
| caz_1_2 | gammarus | 1 | 100 |
| caz_1_2 | hyalella | 13 | 100 |
| caz_1_2 | caecidotea | 9 | 100 |
This example can be taken one step further to calculate relative abundance (i.e., the percentage of the sample represented by each taxon) by adding percent = count / total * 100 to the mutate() call.
group.df <- taxa.df %>%
filter(unique_id %in% c("caz_1_1", "caz_1_2")) %>%
select(unique_id, final_id, count) %>%
group_by(unique_id) %>%
mutate(total = sum(count),
percent = count / total * 100)
knitr::kable(group.df)
| unique_id | final_id | count | total | percent |
|---|---|---|---|---|
| caz_1_1 | physa | 1 | 112 | 0.8928571 |
| caz_1_1 | gyraulus | 1 | 112 | 0.8928571 |
| caz_1_1 | bezzia | 5 | 112 | 4.4642857 |
| caz_1_1 | mallochohelea | 7 | 112 | 6.2500000 |
| caz_1_1 | chironomus | 1 | 112 | 0.8928571 |
| caz_1_1 | dicrotendipes | 3 | 112 | 2.6785714 |
| caz_1_1 | paratanytarsus | 34 | 112 | 30.3571429 |
| caz_1_1 | cricotopus | 1 | 112 | 0.8928571 |
| caz_1_1 | psectrocladius | 5 | 112 | 4.4642857 |
| caz_1_1 | thienemanniella | 4 | 112 | 3.5714286 |
| caz_1_1 | krenopelopia | 9 | 112 | 8.0357143 |
| caz_1_1 | procladius | 1 | 112 | 0.8928571 |
| caz_1_1 | hemerodromia | 1 | 112 | 0.8928571 |
| caz_1_1 | baetidae | 1 | 112 | 0.8928571 |
| caz_1_1 | caenis | 14 | 112 | 12.5000000 |
| caz_1_1 | acentria | 1 | 112 | 0.8928571 |
| caz_1_1 | enallagma | 1 | 112 | 0.8928571 |
| caz_1_1 | agraylea | 1 | 112 | 0.8928571 |
| caz_1_1 | oxyethira | 2 | 112 | 1.7857143 |
| caz_1_1 | hydroptilidae | 1 | 112 | 0.8928571 |
| caz_1_1 | gammarus | 2 | 112 | 1.7857143 |
| caz_1_1 | hyalella | 14 | 112 | 12.5000000 |
| caz_1_1 | caecidotea | 2 | 112 | 1.7857143 |
| caz_1_2 | amnicola_grana | 2 | 100 | 2.0000000 |
| caz_1_2 | physa | 27 | 100 | 27.0000000 |
| caz_1_2 | gyraulus_parvus | 1 | 100 | 1.0000000 |
| caz_1_2 | helisoma_anceps | 2 | 100 | 2.0000000 |
| caz_1_2 | promenetus_exacuous | 5 | 100 | 5.0000000 |
| caz_1_2 | valvata | 1 | 100 | 1.0000000 |
| caz_1_2 | bezzia | 9 | 100 | 9.0000000 |
| caz_1_2 | mallochohelea | 2 | 100 | 2.0000000 |
| caz_1_2 | chironomus | 5 | 100 | 5.0000000 |
| caz_1_2 | dicrotendipes | 4 | 100 | 4.0000000 |
| caz_1_2 | microtendipes | 1 | 100 | 1.0000000 |
| caz_1_2 | polypedilum | 2 | 100 | 2.0000000 |
| caz_1_2 | pseudochironomus | 1 | 100 | 1.0000000 |
| caz_1_2 | rheotanytarsus | 2 | 100 | 2.0000000 |
| caz_1_2 | thienemanniella | 1 | 100 | 1.0000000 |
| caz_1_2 | ablabesmyia | 5 | 100 | 5.0000000 |
| caz_1_2 | callibaetis | 1 | 100 | 1.0000000 |
| caz_1_2 | caenis | 1 | 100 | 1.0000000 |
| caz_1_2 | caenidae | 2 | 100 | 2.0000000 |
| caz_1_2 | acentria | 1 | 100 | 1.0000000 |
| caz_1_2 | mystacides | 2 | 100 | 2.0000000 |
| caz_1_2 | gammarus | 1 | 100 | 1.0000000 |
| caz_1_2 | hyalella | 13 | 100 | 13.0000000 |
| caz_1_2 | caecidotea | 9 | 100 | 9.0000000 |
group_by().Once the calculations that required aggregation (i.e., group_by()) are complete, make sure to remove the aggregation from the data frame using ungroup(). If you forget to do this, R will continue to try to apply the aggregation to future calculations, which will most likely be inappropriate. Generally, you will eventually get an error message if you forget to use ungroup().
ungroup.df <- taxa.df %>%
filter(unique_id %in% c("caz_1_1", "caz_1_2")) %>%
select(unique_id, final_id, count) %>%
group_by(unique_id) %>%
mutate(total = sum(count),
percent = count / total * 100) %>%
ungroup()
knitr::kable(ungroup.df)
| unique_id | final_id | count | total | percent |
|---|---|---|---|---|
| caz_1_1 | physa | 1 | 112 | 0.8928571 |
| caz_1_1 | gyraulus | 1 | 112 | 0.8928571 |
| caz_1_1 | bezzia | 5 | 112 | 4.4642857 |
| caz_1_1 | mallochohelea | 7 | 112 | 6.2500000 |
| caz_1_1 | chironomus | 1 | 112 | 0.8928571 |
| caz_1_1 | dicrotendipes | 3 | 112 | 2.6785714 |
| caz_1_1 | paratanytarsus | 34 | 112 | 30.3571429 |
| caz_1_1 | cricotopus | 1 | 112 | 0.8928571 |
| caz_1_1 | psectrocladius | 5 | 112 | 4.4642857 |
| caz_1_1 | thienemanniella | 4 | 112 | 3.5714286 |
| caz_1_1 | krenopelopia | 9 | 112 | 8.0357143 |
| caz_1_1 | procladius | 1 | 112 | 0.8928571 |
| caz_1_1 | hemerodromia | 1 | 112 | 0.8928571 |
| caz_1_1 | baetidae | 1 | 112 | 0.8928571 |
| caz_1_1 | caenis | 14 | 112 | 12.5000000 |
| caz_1_1 | acentria | 1 | 112 | 0.8928571 |
| caz_1_1 | enallagma | 1 | 112 | 0.8928571 |
| caz_1_1 | agraylea | 1 | 112 | 0.8928571 |
| caz_1_1 | oxyethira | 2 | 112 | 1.7857143 |
| caz_1_1 | hydroptilidae | 1 | 112 | 0.8928571 |
| caz_1_1 | gammarus | 2 | 112 | 1.7857143 |
| caz_1_1 | hyalella | 14 | 112 | 12.5000000 |
| caz_1_1 | caecidotea | 2 | 112 | 1.7857143 |
| caz_1_2 | amnicola_grana | 2 | 100 | 2.0000000 |
| caz_1_2 | physa | 27 | 100 | 27.0000000 |
| caz_1_2 | gyraulus_parvus | 1 | 100 | 1.0000000 |
| caz_1_2 | helisoma_anceps | 2 | 100 | 2.0000000 |
| caz_1_2 | promenetus_exacuous | 5 | 100 | 5.0000000 |
| caz_1_2 | valvata | 1 | 100 | 1.0000000 |
| caz_1_2 | bezzia | 9 | 100 | 9.0000000 |
| caz_1_2 | mallochohelea | 2 | 100 | 2.0000000 |
| caz_1_2 | chironomus | 5 | 100 | 5.0000000 |
| caz_1_2 | dicrotendipes | 4 | 100 | 4.0000000 |
| caz_1_2 | microtendipes | 1 | 100 | 1.0000000 |
| caz_1_2 | polypedilum | 2 | 100 | 2.0000000 |
| caz_1_2 | pseudochironomus | 1 | 100 | 1.0000000 |
| caz_1_2 | rheotanytarsus | 2 | 100 | 2.0000000 |
| caz_1_2 | thienemanniella | 1 | 100 | 1.0000000 |
| caz_1_2 | ablabesmyia | 5 | 100 | 5.0000000 |
| caz_1_2 | callibaetis | 1 | 100 | 1.0000000 |
| caz_1_2 | caenis | 1 | 100 | 1.0000000 |
| caz_1_2 | caenidae | 2 | 100 | 2.0000000 |
| caz_1_2 | acentria | 1 | 100 | 1.0000000 |
| caz_1_2 | mystacides | 2 | 100 | 2.0000000 |
| caz_1_2 | gammarus | 1 | 100 | 1.0000000 |
| caz_1_2 | hyalella | 13 | 100 | 13.0000000 |
| caz_1_2 | caecidotea | 9 | 100 | 9.0000000 |
The image below shows “under the hood”" information about ungroup.df from the [Environment Tab] before and after the ungroup() call is made. We do not need to focus on the details associated with this “under the hood” information but this way we can view the change made by the ungroup() call; unlike the table shown above, which does not provided any indication that the data has been unaggregated.
group_by() and the column(s) created in the summarize() call. This function requires group_by() to be called in advance.In the group_by example, we calculated the percentage of the sample comprised by each final_id taxon. final_id represents the highest taxonomic resolution for which these taxa were identified; therefore, each row represented a unique taxon.
What if we want to explore these samples at a lower resolution? In the example below, the taxa are represented at the order-level. The percent calculations are correct but, in many cases, the same taxon is represented more than once. This is a poor representation of the data. summarize() can help us solve this issue in the subsequent code chunks.
sub.df <- taxa.df %>%
filter(unique_id %in% c("caz_1_1", "caz_1_2")) %>%
select(unique_id, order, count) %>%
group_by(unique_id) %>%
mutate(total = sum(count),
percent = count / total * 100) %>%
ungroup()
knitr::kable(sub.df)
| unique_id | order | count | total | percent |
|---|---|---|---|---|
| caz_1_1 | basommatophora | 1 | 112 | 0.8928571 |
| caz_1_1 | basommatophora | 1 | 112 | 0.8928571 |
| caz_1_1 | diptera | 5 | 112 | 4.4642857 |
| caz_1_1 | diptera | 7 | 112 | 6.2500000 |
| caz_1_1 | diptera | 1 | 112 | 0.8928571 |
| caz_1_1 | diptera | 3 | 112 | 2.6785714 |
| caz_1_1 | diptera | 34 | 112 | 30.3571429 |
| caz_1_1 | diptera | 1 | 112 | 0.8928571 |
| caz_1_1 | diptera | 5 | 112 | 4.4642857 |
| caz_1_1 | diptera | 4 | 112 | 3.5714286 |
| caz_1_1 | diptera | 9 | 112 | 8.0357143 |
| caz_1_1 | diptera | 1 | 112 | 0.8928571 |
| caz_1_1 | diptera | 1 | 112 | 0.8928571 |
| caz_1_1 | ephemeroptera | 1 | 112 | 0.8928571 |
| caz_1_1 | ephemeroptera | 14 | 112 | 12.5000000 |
| caz_1_1 | lepidoptera | 1 | 112 | 0.8928571 |
| caz_1_1 | odonata | 1 | 112 | 0.8928571 |
| caz_1_1 | trichoptera | 1 | 112 | 0.8928571 |
| caz_1_1 | trichoptera | 2 | 112 | 1.7857143 |
| caz_1_1 | trichoptera | 1 | 112 | 0.8928571 |
| caz_1_1 | amphipoda | 2 | 112 | 1.7857143 |
| caz_1_1 | amphipoda | 14 | 112 | 12.5000000 |
| caz_1_1 | isopoda | 2 | 112 | 1.7857143 |
| caz_1_2 | neotaenioglossa | 2 | 100 | 2.0000000 |
| caz_1_2 | basommatophora | 27 | 100 | 27.0000000 |
| caz_1_2 | basommatophora | 1 | 100 | 1.0000000 |
| caz_1_2 | basommatophora | 2 | 100 | 2.0000000 |
| caz_1_2 | basommatophora | 5 | 100 | 5.0000000 |
| caz_1_2 | heterostropha | 1 | 100 | 1.0000000 |
| caz_1_2 | diptera | 9 | 100 | 9.0000000 |
| caz_1_2 | diptera | 2 | 100 | 2.0000000 |
| caz_1_2 | diptera | 5 | 100 | 5.0000000 |
| caz_1_2 | diptera | 4 | 100 | 4.0000000 |
| caz_1_2 | diptera | 1 | 100 | 1.0000000 |
| caz_1_2 | diptera | 2 | 100 | 2.0000000 |
| caz_1_2 | diptera | 1 | 100 | 1.0000000 |
| caz_1_2 | diptera | 2 | 100 | 2.0000000 |
| caz_1_2 | diptera | 1 | 100 | 1.0000000 |
| caz_1_2 | diptera | 5 | 100 | 5.0000000 |
| caz_1_2 | ephemeroptera | 1 | 100 | 1.0000000 |
| caz_1_2 | ephemeroptera | 1 | 100 | 1.0000000 |
| caz_1_2 | ephemeroptera | 2 | 100 | 2.0000000 |
| caz_1_2 | lepidoptera | 1 | 100 | 1.0000000 |
| caz_1_2 | trichoptera | 2 | 100 | 2.0000000 |
| caz_1_2 | amphipoda | 1 | 100 | 1.0000000 |
| caz_1_2 | amphipoda | 13 | 100 | 13.0000000 |
| caz_1_2 | isopoda | 9 | 100 | 9.0000000 |
The most intuitive way to solve this problem, to sum the counts by unique and order columns before calculating the percentages, which can be down with a combination of group_by() and summarize().
group_by() to the unique_id and order columns.
group_by(unique_id, order)summarize() in the same way mutate() was applied in the mutate section. In this case, the count will be overwritten by the sum of the counts aggregated by the unique_id and order columns.
summarize(count = sum(count))ungroup() to remove the groupings created with group_by().In the table output, we can see that each taxon, order, is only represented once per sample, unique_id.
summarize.df <- taxa.df %>%
filter(unique_id %in% c("caz_1_1", "caz_1_2")) %>%
select(unique_id, order, count) %>%
group_by(unique_id, order) %>%
summarize(count = sum(count)) %>%
ungroup()
knitr::kable(summarize.df)
| unique_id | order | count |
|---|---|---|
| caz_1_1 | amphipoda | 16 |
| caz_1_1 | basommatophora | 2 |
| caz_1_1 | diptera | 71 |
| caz_1_1 | ephemeroptera | 15 |
| caz_1_1 | isopoda | 2 |
| caz_1_1 | lepidoptera | 1 |
| caz_1_1 | odonata | 1 |
| caz_1_1 | trichoptera | 4 |
| caz_1_2 | amphipoda | 14 |
| caz_1_2 | basommatophora | 35 |
| caz_1_2 | diptera | 32 |
| caz_1_2 | ephemeroptera | 4 |
| caz_1_2 | heterostropha | 1 |
| caz_1_2 | isopoda | 9 |
| caz_1_2 | lepidoptera | 1 |
| caz_1_2 | neotaenioglossa | 2 |
| caz_1_2 | trichoptera | 2 |
To calculate the percentage of each order, apply the group_by(), mutate(), and ungroup() combination from the group_by section.
summarize.df <- taxa.df %>%
filter(unique_id %in% c("caz_1_1", "caz_1_2")) %>%
select(unique_id, order, count) %>%
group_by(unique_id, order) %>%
summarize(count = sum(count)) %>%
ungroup() %>%
group_by(unique_id) %>%
mutate(total = sum(count),
percent = count / total * 100) %>%
ungroup()
knitr::kable(summarize.df)
| unique_id | order | count | total | percent |
|---|---|---|---|---|
| caz_1_1 | amphipoda | 16 | 112 | 14.2857143 |
| caz_1_1 | basommatophora | 2 | 112 | 1.7857143 |
| caz_1_1 | diptera | 71 | 112 | 63.3928571 |
| caz_1_1 | ephemeroptera | 15 | 112 | 13.3928571 |
| caz_1_1 | isopoda | 2 | 112 | 1.7857143 |
| caz_1_1 | lepidoptera | 1 | 112 | 0.8928571 |
| caz_1_1 | odonata | 1 | 112 | 0.8928571 |
| caz_1_1 | trichoptera | 4 | 112 | 3.5714286 |
| caz_1_2 | amphipoda | 14 | 100 | 14.0000000 |
| caz_1_2 | basommatophora | 35 | 100 | 35.0000000 |
| caz_1_2 | diptera | 32 | 100 | 32.0000000 |
| caz_1_2 | ephemeroptera | 4 | 100 | 4.0000000 |
| caz_1_2 | heterostropha | 1 | 100 | 1.0000000 |
| caz_1_2 | isopoda | 9 | 100 | 9.0000000 |
| caz_1_2 | lepidoptera | 1 | 100 | 1.0000000 |
| caz_1_2 | neotaenioglossa | 2 | 100 | 2.0000000 |
| caz_1_2 | trichoptera | 2 | 100 | 2.0000000 |
The previous code chunk works but it is getting very long. Long code chunks make it harder for your future self or others to interpret and can make it harder to find/troubleshoot bugs. Better practice would be to break the previous code chunk into 2-3 sub-tasks.
The following code chunk into 3 sub-tasks, which should make it easier to follow process and allow us to observe the data after each sub-task is executed.
Explanation of sub-tasks below:
caz1.df: reduce the data frame down to only the rows and columns of interest. In this case, we are only interested in the two replicates from Cazenovia Lake site 1; therefore, I named the data frame caz1.df to describe the data represented within the data frame.order.df: summarize the taxonomic counts by the sample, unique_id, and taxonomic rank, order.percent.df: calculate the percentage of each sample, unique_id, represented by each taxon, order.caz1.df <- taxa.df %>%
filter(unique_id %in% c("caz_1_1", "caz_1_2")) %>%
select(unique_id, order, count)
order.df <- caz1.df %>%
group_by(unique_id, order) %>%
summarize(count = sum(count)) %>%
ungroup()
percent.df <- order.df %>%
group_by(unique_id) %>%
mutate(total = sum(count),
percent = count / total * 100) %>%
ungroup()
Link: https://dplyr.tidyverse.org/reference/join.html left_join * right_join * full_join * [inner_join] * semi_join * anti_join
left.df <- left_join(taxa.df, env.df, by = "unique_id")
DT::datatable(left.df, options = list(scrollX = TRUE))
right.df <- right_join(taxa.df, env.df, by = "unique_id")
DT::datatable(right.df, options = list(scrollX = TRUE))
full.df <- taxa.df %>%
filter(lake == "onon") %>%
full_join(env.df, by = "unique_id")
DT::datatable(full.df, options = list(scrollX = TRUE))
semi.df <- taxa.df %>%
filter(lake == "onon") %>%
semi_join(env.df, by = "unique_id")
DT::datatable(semi.df, options = list(scrollX = TRUE))
sub.df <- taxa.df %>%
filter(lake == "onon")
anti.df <- anti_join(env.df, sub.df, by = "unique_id")
DT::datatable(anti.df, options = list(scrollX = TRUE))
nest.df <- taxa.df %>%
filter(lake == "onon") %>%
nest_join(env.df, by = "unique_id")
DT::datatable(nest.df, options = list(scrollX = TRUE))
*Link: https://dplyr.tidyverse.org/reference/bind.html
onon.df <- taxa.df %>%
filter(lake == "onon")
otis.df <- taxa.df %>%
filter(lake == "ot")
caz.df <- taxa.df %>%
filter(lake == "caz")
bind.df <- bind_rows(onon.df, otis.df)
DT::datatable(bind.df, options = list(scrollX = TRUE))
hier.df <- onon.df %>%
select(phylum:species)
DT::datatable(hier.df, options = list(scrollX = TRUE))
count.df <- onon.df %>%
select(unique_id, final_id, count)
DT::datatable(count.df, options = list(scrollX = TRUE))
bind.df <- bind_cols(count.df, hier.df)
DT::datatable(bind.df, options = list(scrollX = TRUE))
library(tidyr)
suppressPackageStartupMessages(
library(dplyr)
)
Import the example data. This data represents benthic macroinvertebrate data collected in the littoral zone of Onondaga, Otisco, and Cazenovia lakes.
taxa.df <- file.path("data",
"zms_thesis-macro_2017-06-18.csv") %>%
read.csv(stringsAsFactors = FALSE)
Preprocess taxa.df to just represent order-level taxonomic counts per sample. For more details about this process see the summarize section.
ord.df <- taxa.df %>%
select(unique_id, order, count) %>%
group_by(unique_id, order) %>%
summarize(count = sum(count)) %>%
ungroup()
DT::datatable(ord.df, options = list(columnDefs = list(list(className = 'dt-center', targets = 0:3))))
In some instances it might be beneficial to transpose a data frame from a long format to a wide format, which can be simply done with spread(). spread() requires two columns to be specified:
The remaining columns will be used as an anchor point to represent a unique key for each row. If these remaining columns are not unique, an error will be returned. You will need to figure out why there are duplicate rows in the unique identifier columns and how to remedy the issue.
In the example below, ord.df is transformed from a long to a wide format. order values now represent column headers and the values from the count column have been filled in appropriately under the associated new order column headers.
wide.df <- ord.df %>%
spread(order, count)
DT::datatable(wide.df, options = list(scrollX = TRUE))
In the example above, any instance where a taxon was not found in a sample, the value is represented as an NA (represented as a blank space by the output of DT::datatable()). However, in this example it would be more appropriate to represent all of these values as a zero. To do this we just need to specify fill = 0 within the spread() call. In the table below, there are no NAs(represented as a blank space by the output of DT::datatable()).
wide.df <- ord.df %>%
spread(order, count, fill = 0)
DT::datatable(wide.df, options = list(scrollX = TRUE))
In most instances, packages from the tidyverse are designed to operate on data in a long data format. gather() makes is it simple to convert a wide format to a long format.
In the this code chunk, the wide.df is used from the spread section and will be converted from a wide to a long data format using gather(). order represents a new column name, which by default will include all column names from wide.df. count also represents a new column name, which will represent all of the values that were present in each column. If just gather(order, count) is applied, the unique_id is included within the order column and the values within from the unique_id column are included within the count column. This is no correct. The next code chunk will correct this issue.
long.df <- wide.df %>%
gather(order, count)
DT::datatable(long.df,
options = list(columnDefs = list(list(className = 'dt-center', targets = 0:2))))
Adding -unique_id to the end of the gather() call will retain unique_id in the final output but exclude this column from being included in the conversion of the remaining headers to the order column and the remaining row values to the count column. One or more columns can be excluded following the -unique_id example.
long.df <- wide.df %>%
gather(order, count, -unique_id)
DT::datatable(long.df,
options = list(columnDefs = list(list(className = 'dt-center', targets = 0:3))))
In some instances, a data frame may contain a column that has concatenated information that is separated by a common character or pattern. It may be beneficial to extract this concatenated information into separate columns to make it easier to perform subsequent tasks, such as filtering. separate() can be used to make this a simple task.
This example uses long.df from the gather section. In the separate() call:
unique_id refers to the name of the column to be splitc("lake", "station_id", "replicate") refers to the new column names that the split values from unique_id will fillsep = "_" specifies that the strings in unique_id should be split by underscoresThe default of separate() is to remove (remove = TRUE) the original column unique_id from the returned data frame.
sep.df <- long.df %>%
separate(unique_id, c("lake", "station_id", "replicate"), sep = "_")
DT::datatable(sep.df,
options = list(columnDefs = list(list(className = 'dt-center', targets = 0:5))))
I often find it helpful to specify remove = FALSE, which, in this example, will split unique_id into several new columns but also retain `unique_id in the final output.
sep.df <- long.df %>%
separate(unique_id,
c("lake", "station_id", "replicate"),
sep = "_",
remove = FALSE)
DT::datatable(sep.df,
options = list(columnDefs = list(list(className = 'dt-center', targets = 0:6))))
unite() is the opposite of separate, it is used to combine values from multiple columns into a single string separated by a common character or pattern.
Using sep.df, created in the separate section, columns lake, station_id, and replicate can be concatenated into a single string within a new column using unite().
In the unite() call:
"unique_id2" refers to the new column name that will contain the concatenated strings from the subsequently specified columnsc("lake", "station_id", "replicate") refers to the column names that contain the values that we want to be concatenated in unique_id2.sep = "_" specifies that the strings from c("lake", "station_id", "replicate") should be denoted in unique_id2 by an underscoreThe default of unite() is to remove (remove = TRUE) the concatenated columns (c("lake", "station_id", "replicate")) from the returned data frame.
unite.df <- sep.df %>%
unite("unique_id2",
c("lake", "station_id", "replicate"),
sep = "-")
DT::datatable(unite.df,
options = list(columnDefs = list(list(className = 'dt-center', targets = 0:4))))
In some cases I find it useful to specify remove = FALSE, which, in this example, will retain lake, station_id, and replicate in the final output.
unite.df <- sep.df %>%
unite("unique_id2",
c("lake", "station_id", "replicate"),
sep = "-",
remove = FALSE)
DT::datatable(unite.df,
options = list(columnDefs = list(list(className = 'dt-center', targets = 0:7))))
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
suppressPackageStartupMessages(
library(dplyr)
)
Import the example data. This data represents benthic macroinvertebrate data collected in the littoral zone of Onondaga, Otisco, and Cazenovia lakes.
taxa.df <- file.path("data",
"zms_thesis-macro_2017-06-18.csv") %>%
read.csv(stringsAsFactors = FALSE)
Preprocess taxa.df to only include unique instances of station IDs and sample dates. For more details about this process see the select and distinct sections.
dates.df <- taxa.df %>%
select(station_id, date) %>%
distinct()
DT::datatable(dates.df, options = list(columnDefs = list(list(className = 'dt-center', targets = 0:2))))
In dates.df, the date column is imported as a character class and follows a “mm/dd/yyyy” format. The function mdy() can be used convert the character strings in the date column to a date class.
mdy.df <- dates.df %>%
mutate(date = mdy(date))
DT::datatable(mdy.df, options = list(columnDefs = list(list(className = 'dt-center', targets = 0:2))))
In the example above, it is obvious the the format of the date has changed but it is not obvious that the R-class has changed. First look at the classes represented in the dates.df.
sapply(dates.df, class)
## station_id date
## "character" "character"
Then looking at the column classes in myd.df, we can see date has changed to class “Date”.
sapply(mdy.df, class)
## station_id date
## "character" "Date"
Once a column is a date or datetime class, then lubridate functions make it easy to extract parts of the date, such as year, month, day, hour, minutes, seconds, etc. In the mutate() call below, I applied many but not all of the helpful functions for extracting datetime related information. The majority of these are straight forward; however, we can change label and abbr to alter the output of functions like month() and wday().
label
label = FALSE returns a numeric valuelabel = TRUE returns a character valueabbr
label = FALSE, then abbr has no effectlabel = TRUE and abbr = TRUE returns an abbreviated character string
label = TRUE and abbr = FALSE returns an full character string
extract.df <- mdy.df %>%
mutate(year = year(date),
month_int = month(date),
month_abv = month(date, label = TRUE),
month_full = month(date, label = TRUE, abbr = FALSE),
week = week(date),
day = day(date),
wday_int = wday(date),
wday_abv = wday(date, label = TRUE),
wday_full = wday(date, label = TRUE, abbr = FALSE),
mday = mday(date),
qday = qday(date),
yday = yday(date),
hour = hour(date),
minute = minute(date),
second = second(date))
DT::datatable(extract.df, options = list(scrollX = TRUE))
round_date() will round the date or datetime by the specified unit of time, such as “15 minutes”, “week”, “month”, or “year”. I find it really convient that you can specify to the nearest “15 minutes”. floor_date() and ceiling_date() provide similar functionality but always round down or up, respectively.
round.df <- mdy.df %>%
mutate(round_week = round_date(date, "week"),
round_month = round_date(date, "month"),
round_year = round_date(date, "year"),
round_year5 = round_date(date, "5 years"),
round_century = round_date(date, "100 years"),
floor_month = floor_date(date, "month"),
floor_year = floor_date(date, "year"),
ceiling_month = ceiling_date(date, "month"),
ceiling_year = ceiling_date(date, "year"))
DT::datatable(round.df, options = list(scrollX = TRUE,
autoWidth = TRUE,
columnDefs = list(list(width = '70px', targets = c(2)))))
library(ggplot2)
suppressPackageStartupMessages(
library(dplyr)
)
library(tidyr)
Import the example data. This data represents benthic macroinvertebrate data collected in the littoral zone of Onondaga, Otisco, and Cazenovia lakes.
taxa.df <- file.path("data",
"zms_thesis-macro_2017-06-18.csv") %>%
read.csv(stringsAsFactors = FALSE)
Preprocess taxa.df to just represent order-level taxonomic counts per sample. For more details about this process see the summarize section.
ord.df <- taxa.df %>%
select(unique_id, station_id, lake, order, count) %>%
group_by(unique_id, station_id, lake, order) %>%
summarize(count = sum(count)) %>%
ungroup() %>%
group_by(station_id, lake, order) %>%
summarize(count = mean(count)) %>%
ungroup()
DT::datatable(ord.df, options = list(columnDefs = list(list(className = 'dt-center', targets = 0:3))))
Calculate the calculate relative abundance (i.e., the percentage of the sample represented by each taxon) of each taxon within a sample. For more details about this process see the group_by section.
pct.df <- ord.df %>%
group_by(station_id) %>%
mutate(total = sum(count),
percent = count / total * 100) %>%
ungroup() %>%
tidyr::complete(order,
nesting(station_id, lake, total),
fill = list(count = 0, percent = 0)) %>%
mutate(lake = factor(lake, levels = c("onon", "ot", "caz")))
Import environmental data associated with the taxonomic counts.
env.df <- file.path("data",
"zms_thesis-env_2017-06-18.csv") %>%
read.csv(stringsAsFactors = FALSE) %>%
mutate(station_id = stringr::str_sub(unique_id, 1, -3)) %>%
select(-unique_id) %>%
distinct()
DT::datatable(env.df, options = list(scrollX = TRUE))
Generally, you want to start with a ggplot() call. Here we can see a blank figure template returned when ggplot() is called and pct.df is specified as data. The pipe operator can be used to pipe data into ggplot() but it cannot be used to chain subsequent ggplot2 functions together.
ggplot(pct.df)
pct.df %>%
ggplot()
To start to format the figure, you will want to use aes(). When using aes() within ggplot(), you are specifying global variables. For instance, you will generally specify the columns, from the specified data frame (e.g., pct.df), that will represent your x-axis and y-axis. In the example below, the lake column represents the x-axis, while the percent column will represents the y-axis. Now we have a template for our figure but we will want to add layers (boxplot, scatter plot, bar plot) to this feature to visualize our data.
pct.df %>%
ggplot(aes(x = lake, y = percent))
### Adding Layers (
+)
ggplot2 does not recognize the pipe operator (%>%), instead it uses + to add components to a plot.
Now let’s add data to the plot by adding a + at the end of ggplot() and then specifying the plot type we are interested in creating. All plot functions in ggplot2 start with the prefix “geom_”.
Here we will create a box and whisker plot using geom_boxplot(). For this example, we will only focus on the percentage of Ephemeroptera (Mayflies) found within each lake by using the filter() function before the ggplot() function. The aesthetics have been specified within ggplot(), therefore geom_boxplot() can be called without any arguments.
pct.df %>%
filter(order == "ephemeroptera") %>%
ggplot(aes(x = lake, y = percent)) +
geom_boxplot()
The color of the interquartile boxes can be specified with fill. In this example, fill = lake, which ggplot2 will automatically realize represents three categories and use a default color palette that is used to automatically color the plot. fill is specified within aes() within the ggplot() , and therefore this specification carries through to the subsequent geom_boxplot() call.
pct.df %>%
filter(order == "ephemeroptera") %>%
ggplot(aes(x = lake, y = percent, fill = lake)) +
geom_boxplot()
Color can be sepecified in a similar manner to Fill. In the plot below, we can see the difference between color and fill (the previous plot).
pct.df %>%
filter(order == "ephemeroptera") %>%
ggplot(aes(x = lake, y = percent, color = lake)) +
geom_boxplot()
When I first started using ggplot2, I would frequently try to put the color or other aesthetic arguments outside of the aes() call. For example, in ggplot(aes(x = lake, y = percent), color = lake), color = lake is not within the aes() call. Therefore, it has no effect on the plot (i.e., there is no color added to the plot).
pct.df %>%
filter(order == "ephemeroptera") %>%
ggplot(aes(x = lake, y = percent), color = lake) +
geom_boxplot()
aes() and the color or other aesthetics can also be specified within the geom_boxplot() call. This allows you to be more specific of which feature you want to recieve a given aesthetic.
pct.df %>%
filter(order == "ephemeroptera") %>%
ggplot(aes(x = lake, y = percent)) +
geom_boxplot(aes(color = lake))
Additionally, if you want to specify specific colors not based on features from your data frame (e.g., lake), then you can specify these colors within the geom_boxplot() call. Note, that color is NOT specified within an aes() call.
pct.df %>%
filter(order == "ephemeroptera") %>%
ggplot(aes(x = lake, y = percent)) +
geom_boxplot(color = c("purple", "orange", "brown"))
If you do attempt to wrap manually specified colors, like color = c("purple", "orange", "brown"), within aes(), you will get an error.
pct.df %>%
filter(order == "ephemeroptera") %>%
ggplot(aes(x = lake, y = percent)) +
geom_boxplot(aes(color = c("purple", "orange", "brown")))
## Error: Aesthetics must be either length 1 or the same as the data (48): colour
Also, trying to manually specify colors within the ggplot() call will have no effect on the plot.
pct.df %>%
filter(order == "ephemeroptera") %>%
ggplot(aes(x = lake, y = percent), color = c("purple", "orange", "brown")) +
geom_boxplot()
Let’s add another layer to the plot above. geom_point() can be used to vizualize where the data points actually fall relative to the box and whisker plots.
pct.df %>%
filter(order == "ephemeroptera") %>%
ggplot(aes(x = lake, y = percent)) +
geom_boxplot() +
geom_point()
In the geom_point plot above, it is not possible to know if a point represents a single sample or if there are more than one samples with the same value overlaid on one another. geom_jitter() can be used to randomly offset the points from thier actual position to avoid the overlap issue present in geom_point().
In this example, geom_jitter() is used to offest the points overlaid on the box and whisker plots. The points are colored by lake within this call: geom_jitter(aes(color = lake)). Additionally, the outliers plotted from geom_boxplot() are hidden with geom_boxplot(outlier.shape = NA). These outliers are represented in the geom_jitter() call; therefore, if the outliers were not removed from geom_boxplot(), these points would be represented twice.
pct.df %>%
filter(order == "ephemeroptera") %>%
ggplot(aes(x = lake, y = percent)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(aes(color = lake))
Here we can see what happens if color is specified within the ggplot() call. Now the box and whisker lines and the points are colored by lake. This is in contrast to the last plot, where only the points were colored by lake.
pct.df %>%
filter(order == "ephemeroptera") %>%
ggplot(aes(x = lake, y = percent, color = lake)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter()
This plot helps us to see that there are a lot of samples from Onondaga Lake (“onon”) that have no Ephemeroptera (percent = 0).
pct.df %>%
filter(order == "ephemeroptera") %>%
ggplot(aes(x = lake, y = percent, fill = lake)) +
geom_violin()
Similar to [violin_plots], geom_dotplot() can be more informative than a box and whisker plot.
pct.df %>%
filter(order == "ephemeroptera") %>%
ggplot(aes(x = lake, y = percent, fill = lake)) +
geom_dotplot(binaxis = "y",
stackdir = "center",
binwidth = 0.25)
geom_dotplot() could be overlaid on geom_boxplot(), which may be a better than a geom_jitter example.
pct.df %>%
filter(order == "ephemeroptera") %>%
ggplot(aes(x = lake, y = percent)) +
geom_boxplot(outlier.shape = NA) +
geom_dotplot(aes(fill = lake),
binaxis = "y",
stackdir = "center",
binwidth = 0.25)
For this section, the data will need to be summarized by lake. The data is aggregated by lake and order, and the median abundance (count) is calculated. The top five taxa observed are retained as normal but the remaining taxa are lumped and reclassified as “Other” (forcats::fct_lump(factor(order), n = 5, w = percent)). The taxa are then sorted in descending order based on the taxa most commonly observed (forcats::fct_reorder(order, percent, median, .desc = TRUE)).
abund.df <- pct.df %>%
group_by(lake, order) %>%
summarise(count = median(count)) %>%
ungroup() %>%
mutate(order = forcats::fct_lump(factor(order),
n = 5,
w = count),
order = forcats::fct_reorder(order, count, median, .desc = TRUE),
lake = factor(lake, c("onon", "ot", "caz")))
DT::datatable(abund.df, options = list(scrollX = TRUE))
Using geom_bar(), we can generate a bar plot. state = "identity" must be used when a y value is specified in aes(); otherwise, ggplot2 will simiply count the number of x value instances to create a y-axis count.
WARNING: This plot is a quick example of geom_bar() and is not a good representation of the data. In this case, the median abundance was calculated for each taxon within each lake. geom_bar() is summing the median values per taxon per lake to produce this figure.
abund.df %>%
ggplot(aes(x = order, y = count)) +
geom_bar(stat = "identity")
As the warning above notes, the above plot is not a great representation of the data because it essentially sums the three median abundance values per taxon per lake to produce the bars. We can see this breakdown by adding fill = TRUE. This is also a poor way of view this data.
abund.df %>%
ggplot(aes(x = order, y = count, fill = lake)) +
geom_bar(stat = "identity")
position = "dodge" can be added to the geom_bar() call to make the bars for each lake appear side-by-side.
abund.df %>%
ggplot(aes(x = order, y = count, fill = lake)) +
geom_bar(stat = "identity",
position = "dodge")
Another great way to visualize this data is to create stacked bar plots. Stacked bar plots are almost always the better alternative to pie charts. In general, stack bar charts provide a more straight forward representation of quantity, which makes it easier to compare multiple stacked bar charts, then multiple pie charts.
lake, the y-axis represents count, and the fill represents order.Stacked bar plots can be created by adding position = "stack" to the geom_bar() call. This information is exactly the same as the geom_bar() plot above, where position = "dodge" but it is more condensed.
abund.df %>%
mutate(order = forcats::fct_reorder(order, count, median)) %>%
ggplot(aes(x = lake, y = count, fill = order)) +
geom_bar(stat = "identity",
position = "stack")
Stacked bar plots can also be created by adding position = "fill" to the geom_bar() call. This effectively calculates relative abundance, the percentage each taxon represents within a lake. This normalizes the data and makes it easier to compare between lakes.
abund.df %>%
mutate(order = forcats::fct_reorder(order, count, median)) %>%
ggplot(aes(x = lake, y = count, fill = order)) +
geom_bar(stat = "identity",
position = "fill")
pct.df %>%
select(station_id, lake, percent, order) %>%
spread(order, percent) %>%
ggplot(aes(x = diptera, y = amphipoda)) +
geom_point()
pct.df %>%
select(station_id, lake, percent, order) %>%
spread(order, percent) %>%
ggplot(aes(x = diptera, y = amphipoda)) +
geom_point(aes(color = lake))
pct.df %>%
select(station_id, lake, percent, order) %>%
spread(order, percent) %>%
ggplot(aes(x = diptera, y = amphipoda)) +
geom_point(aes(color = lake)) +
geom_line()
pct.df %>%
select(station_id, lake, percent, order) %>%
spread(order, percent) %>%
ggplot(aes(x = diptera, y = amphipoda)) +
geom_point(aes(color = lake)) +
geom_smooth(method = "lm", formula = y ~ x)
pct.df %>%
select(station_id, lake, percent, order) %>%
spread(order, percent) %>%
ggplot(aes(x = diptera, y = amphipoda, color = lake)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x)
pct.df %>%
select(station_id, lake, percent, order) %>%
spread(order, percent) %>%
ggplot(aes(x = diptera, y = amphipoda)) +
geom_point(aes(color = lake)) +
geom_smooth(method = "loess", formula = y ~ x)
pct.df %>%
select(station_id, lake, percent, order) %>%
spread(order, percent) %>%
ggplot(aes(x = diptera, y = amphipoda, color = lake, fill = lake)) +
geom_point(aes(color = lake)) +
geom_smooth(method = "loess", formula = y ~ x)
pct.df %>%
select(station_id, lake, percent, order) %>%
spread(order, percent) %>%
gather(order, count, amphipoda:veneroida, -diptera) %>%
ggplot(aes(x = diptera, y = count, color = lake, fill = lake)) +
geom_point(aes(color = lake)) +
geom_smooth(method = "loess", formula = y ~ x, se = FALSE) +
facet_wrap(~order,
ncol = 1,
scales = "free")
Shiny is an R package that enables the developer to create interactive web applications (apps).
My Example: https://zsmith.shinyapps.io/WQT_Shiny/
R Studio has a website dedicated shiny (https://shiny.rstudio.com/). There are a lot of resources available here but for those just learning shiny or looking for a shiny refresher I would direct you to the tutorial page (https://shiny.rstudio.com/tutorial/).
Shiny apps are mainly composed of three files: 1) ui.R, 2) server.R, and 3) global.R.
In this file, you will specify the aesthetic aspects of the app. This includes: the presence/absence of a navigation bar, the presence and position of a dropdown menu, the presence and position of a sliderbar, and the location of figure created in the server.R file.
In this file, you will specify reactive functions that respond to user inputs. For example, an app may contain a dropdown menu of sample sites and scatterplot representing the site selected in the dropdown menu. Each time the user selects a new sample site from the dropdown menu the scatterplot would update.
In this file, you should include code that is static. This includes: loading libraries, sourcing functions, and potentially importing data. These activities are intended to occur once and will not be reacting to user inputs.
Shiny projects generally grow rapidly and it can become difficult to navigate hundreds of lines of code in a ui.R or server.R file. My preference is to break the code up into independent and more manageable scripts that are sourced in the ui.R or server.R files. For example, imagine you are developing an app which contains two tabs, one dedicated to tabular data and one dedicated to an interactive map. I would develop separate R scripts for server code associated with each tab. Similarly, I would create separate R scripts for the ui code associated with each tab. These files are stored in the appropriate folders labeled either “ui” or “server.” When sourcing files in a shiny app you must specify “local = TRUE”.
source("server/select_import_server.R", local = TRUE)
Most Shiny apps will require multiple R-packages. I recommend loading all of the necessary R-packages in the global.R file. This makes it simple to identify all the packages you must have installed locally to edit and develop a given shiny app. One way to simplify this task is to use the example provided by the following link: https://www.r-bloggers.com/install-and-load-missing-specifiedneeded-packages-on-the-fly/. Following this scripts template:
This makes it easier to collaborate with others or work on multiple computers.
*Link: https://rstudio.github.io/DT/
The R-package, DT, is great resource for creating interactive tables.
*Link: https://rstudio.github.io/dygraphs/
The R-package, dygraphs, is great resource for creating interactive time series plots.
*Link: http://rstudio.github.io/leaflet/
The R-package, Leaflet, is a great resource for creating interactive maps. When using this package in shiny there are a few steps you need to take to make the map function well (http://rstudio.github.io/leaflet/shiny.html). It is generally useful to create a leafletProxy, which will load the base map as shiny output. You can then use reactive functions to update the points presented on the map. Using leafletProxy, only the map points will update, the base map will remain unchanged. This prevents the need to reload the entire map each time the points are updated, which makes it appear that the map is flashing.
*Link: https://plot.ly/r/
The R-package, plotly, is great resource for creating interactive figures.
*Link: http://www.shinyapps.io/
shinyapps.io is a shiny hosting platform provided by R Studio. Users must create a shinyapps.io free account or a paid account. The free account limits the number of applications you can publish and the number of hours the app can be active per month. There are multiple tiers to the paid accounts. As the user pays more, the user can publish a greater amount of applications, more active hours are available per month, and other additional benefits are supplied by R Studio.
Click on the “Publish to Server” button located in the top right corner of the source pane.
A “Deploy” tab will appear in the console pane, which will inform you that R Studio is working on publishing your app. This will take a few minutes. If there are any issues, the app will stop deploying and you will receive an error message.